fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450
Conversation
Single-rank LAMMPS .pt2 inference for a message-passing model (DPA2,
DPA3, hybrids over those) silently relied on LAMMPS atom-map to populate
``InputNlist.mapping`` — without ``atom_modify map yes`` the C++ side
fell into an identity-mapping fallback (``DeepPotPTExpt.cc:374-384``)
whose values are wrong for ghost slots ``[nloc, nall)``. The model
graph's ``_exchange_ghosts`` (``deepmd/dpmodel/descriptor/repformers.py``)
then performed ``take_along_axis(g1[1, nloc, dim], mapping_tiled)`` with
out-of-bounds gather indices for ghosts, producing a CUDA index assert
(reproduced by the user on a DPA4 model) or undefined CPU output.
Multi-rank LAMMPS without a with-comm AOTI artifact has the same class
of failure: ``pair_deepmd.cpp:243`` only populates ``lmp_list.mapping``
for ``nprocs == 1``, so the regular path always misses the ghost mapping
under multi-rank.
Both unsupported combinations now fail-fast with an actionable message
instead of silently corrupting ghost features.
Files:
* deepmd/pt_expt/utils/serialization.py — emit ``has_message_passing``
in .pt2 metadata, mirroring the descriptor's ``has_message_passing()``
API (true for DPA2 / DPA3 / hybrids over those; false for se_e2_a /
DPA1).
* source/api_cc/{include,src}/DeepPotPTExpt.{h,cc} and DeepSpinPTExpt
— read the metadata into ``has_message_passing_``, gate the fail-fast
on it so non-GNN models retain their previous behaviour. Refined
predicate:
``has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0``
Two error messages: single-rank ("add atom_modify map yes") and
multi-rank ("re-export with use_loc_mapping=False"). Defaults to
``false`` for pre-PR .pt2 archives that lack the field, so non-GNN
archives continue to work; GNN archives must be regenerated to opt
into the fail-fast guard.
* source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py — new
``--no-atom-map`` flag that omits ``atom_modify map yes`` from the
LAMMPS input.
* source/lmp/tests/test_lammps_dpa3_pt2.py — four-cell coverage matrix:
A : test_pair_deepmd (existing)
B : test_pair_deepmd_no_atom_map_fails_fast (new)
C : test_pair_deepmd_with_comm (new)
D : test_pair_deepmd_with_comm_no_atom_map_fails_fast (new)
C-mr: test_pair_deepmd_mpi_dpa3 (existing)
B-mr: test_pair_deepmd_mpi_no_with_comm_fails_fast (new)
D-mr: test_pair_deepmd_mpi_no_atom_map (new)
Investigation note: the ``test_deeppot_dpa_ptexpt.cc`` C++ ctest is
misleadingly named — despite the "Dpa" prefix it loads
``deeppot_dpa1.pt2`` (DPA1, non-message-passing), so its regular .pt2
graph never consumes ``mapping`` for ghost gather and identity fallback
is trivially safe. The genuinely-DPA2 ctest is
``test_deeppot_dpa2_ptexpt.cc``, which already explicitly sets
``inlist.mapping = mapping.data();`` on every ``cpu_lmp_nlist*`` case.
No C++ ctest fixtures need to be edited by this PR — the metadata-gated
fail-fast correctly skips non-message-passing models.
Local verification: 160/160 ptexpt ctests pass against freshly-regenerated
.pt2 fixtures (the new metadata field is written by all gen_*.py
scripts). The negative cells B/B-mr/D/D-mr fail-fast paths are
exercised only via the LAMMPS Python tests in CI.
Known limitations:
- Multi-rank DPA3 ``use_loc_mapping=True`` is permanently unsupported;
the fail-fast surfaces this clearly.
- Single-rank with-comm-artifact + no atom-map (cell D) could be made
to work by populating a synthetic self-mirror comm_dict; deferred.
- ``MPI_Comm_size`` is not used as the multi-rank predicate because
api_cc does not link MPI; ``lmp_list.nswap > 0`` serves as the proxy
(equivalent for all current LAMMPS configurations).
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughSerializer adds ChangesMessage Passing Metadata Export and Dispatch Guard
Sequence DiagramsequenceDiagram
participant Serializer
participant Init as Runtime::init
participant Compute as Runtime::compute(lmp_list)
participant Dispatch as DispatchGate
participant WithComm as run_model_with_comm
participant Regular as run_model
participant Error as FailFastError
Serializer->>Init: write has_message_passing metadata
Init->>Compute: set has_message_passing_
Compute->>Compute: derive multi_rank, check mapping presence
Compute->>Dispatch: set use_with_comm = has_comm_artifact_ && multi_rank
alt has_message_passing_ && nghost > 0 && no mapping
Dispatch->>Error: throw rank-specific error
else use_with_comm
Dispatch->>WithComm: call run_model_with_comm
else
Dispatch->>Regular: call run_model
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 917524709b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5450 +/- ##
==========================================
+ Coverage 82.48% 82.50% +0.02%
==========================================
Files 830 830
Lines 88522 88559 +37
Branches 4232 4243 +11
==========================================
+ Hits 73015 73066 +51
+ Misses 14220 14201 -19
- Partials 1287 1292 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… map no Address Codex review (PR deepmodeling#5450) and CI failure: 1. **Codex P1 (DeepPotPTExpt.cc, DeepSpinPTExpt.cc)**: tighten the fail-fast predicate from ``!use_with_comm && !atom_map_present`` to ``!use_with_comm && (multi_rank || !atom_map_present)`` Previously a hypothetical caller that populated ``lmp_list.mapping`` on a multi-rank invocation (any C++ user bypassing pair_deepmd's ``nprocs == 1`` gate, or a future patch that propagated atom-map under multi-rank) could slip past the guard and silently use the regular path's rank-local mapping to gather cross-rank ghosts — exactly the silent-corruption class this PR is meant to close. The new predicate fails-fast multi-rank unconditionally on atom_map_present. 2. **CI failure**: LAMMPS rejects ``atom_modify map no`` ("Illegal atom_modify map command argument no", src/atom.cpp:870). The valid keywords are ``array | hash | yes``; to leave the atom-map disabled, the command must simply be omitted (default for ``atom_style atomic``). Update both the pytest ``_lammps`` helper and the MPI runner script to omit the line when ``atom_map="no"`` / ``--no-atom-map`` is set instead of emitting the invalid syntax. Local verification: ``*PtExpt*`` ctest filter — 160/160 PASSED. Known limitation: the codex P1 fix is defensive — no in-tree LAMMPS caller can reach the ``multi_rank && atom_map_present`` combination because pair_deepmd.cpp:242 gates atom-map propagation on ``nprocs == 1``. A C++ gtest could synthesize the case via ``inlist.nswap = 1`` + ``inlist.mapping = …`` (precedent: test_with_comm_load_failure_ptexpt.cc); deferred from this commit to keep the PR focused.
Per review feedback: the previous predicate
has_message_passing_ && !use_with_comm && nghost > 0 &&
(multi_rank || !atom_map_present)
collapsed two distinct conditions into one boolean, making it hard to
trace which matrix cell each throw corresponds to.
Restructure as two side-by-side throws, each mirroring one row of the
PR's coverage matrix:
if (has_message_passing_ && nghost > 0) {
if (multi_rank && !has_comm_artifact_) throw multi-rank msg; // B-mr
if (!multi_rank && !atom_map_present) throw single-rank msg; // B / D
}
Functionally identical to the prior boolean (verified equivalence by
truth table; 160/160 ptexpt ctests pass), but the structure now maps
1-to-1 to the PR description's table. Applied symmetrically in
DeepPotPTExpt.cc and DeepSpinPTExpt.cc.
The four "no-throw" cells fall out naturally:
- non-GNN (has_message_passing_=false) → no throw
- nghost==0 (NoPbc / isolated cluster) → no throw
- multi-rank && has_comm_artifact_ (cells C-mr, D-mr) → no throw (with-comm)
- single-rank && atom_map_present (cells A, C) → no throw (regular w/ map)
Mirror the non-spin DPA3 coverage matrix in ``test_lammps_dpa3_pt2.py``
for the spin path (DeepSpinPTExpt / pair_deepspin), ensuring the
fail-fast guard against silent corruption fires symmetrically for
the spin GNN.
* source/tests/infer/gen_spin.py — add ``_build_dpa3_single_yaml``
producing a DPA3 spin model with ``use_loc_mapping=True``. Together
with the existing ``deeppot_dpa3_spin_mpi.pt2`` (use_loc_mapping=
False), this covers both with-comm and no-with-comm spin variants
needed for cells A/B/B-mr (single-artifact) and C/C-mr/D/D-mr
(dual-artifact).
* source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py — add
``--no-atom-map`` flag that omits ``atom_modify map yes`` from the
LAMMPS input (mirrors the non-spin runner; LAMMPS rejects
``atom_modify map no`` so omission is the only valid path).
* source/lmp/tests/test_lammps_spin_dpa3_pt2.py — extend
``_run_mpi_subprocess`` with ``pb_path`` and ``capture`` parameters
(mirrors the non-spin helper). Add six new tests covering the
spin matrix:
test_pair_deepspin_single_artifact_with_atom_map (cell A)
test_pair_deepspin_no_atom_map_fails_fast (cell B)
test_pair_deepspin_mpi_no_with_comm_fails_fast (cell B-mr)
test_pair_deepspin_with_comm_single_atom_map (cell C)
test_pair_deepspin_with_comm_no_atom_map_fails_fast (cell D)
test_pair_deepspin_mpi_no_atom_map (cell D-mr)
Cell C-mr remains covered by the existing
``test_pair_deepmd_mpi_dpa3_spin``.
Local verification: gen_spin.py exit 0 produces ``deeppot_dpa3_spin.pt2``
with metadata ``has_message_passing=true, has_comm_artifact=false,
is_spin=true`` — exactly the single-artifact-GNN-spin case for cells
A/B/B-mr. pytest --collect-only reports 9 tests total (3 existing +
6 new). Negative cells (B/B-mr/D) are CI-only — require LAMMPS-with-
deepmd runtime.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/lmp/tests/test_lammps_spin_dpa3_pt2.py`:
- Around line 458-476: The test function test_pair_deepspin_mpi_no_atom_map
currently compares potential energy, forces, and force magnitudes between
out_no_map and out_baseline but omits virial; add a parity assertion for the
virial tensor by comparing out_no_map["virial"] to out_baseline["virial"] (use
np.testing.assert_allclose with similar tolerances as forces, e.g., atol=1e-8,
rtol=0) so virial regressions are detected; place this assertion alongside the
existing force/force_mag checks referencing the same out_no_map and out_baseline
variables.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0f4fc365-5441-4066-8a47-36ef3404b756
📒 Files selected for processing (3)
source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.pysource/lmp/tests/test_lammps_spin_dpa3_pt2.pysource/tests/infer/gen_spin.py
njzjz-bot
left a comment
There was a problem hiding this comment.
Thanks, the new version addresses the earlier metadata probing/comment issues and the nprocs predicate is a good improvement over nswap for spin.
I left one blocking inline test-coverage comment: test_pair_deepspin_mpi_no_atom_map compares energy/forces/force_mag but still skips per-atom virials even though the runner parses them and the adjacent parity tests assert them. Please add the virial parity assertion before merge.
CI is otherwise green from GitHub checks.
— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)
Replaces the ``lmp_list.nswap > 0`` proxy with a direct ``lmp_list.nprocs > 1`` predicate in DeepPotPTExpt / DeepSpinPTExpt dispatch. ``atom_style spin`` populates ``nswap`` even in single-rank runs (to propagate PBC ghost spins), so the proxy was unsound and caused single-rank spin LAMMPS jobs to be misclassified as multi-rank by the fail-fast guard. Also populates ``lmp_list.mapping`` from the LAMMPS atom-map in ``pair_deepspin.cpp`` (mirroring ``pair_deepmd.cpp``), so single-rank spin .pt2 GNN inference can resolve ghost indices. Mechanism: - Add ``int nprocs`` field + ``set_nprocs`` to ``InputNlist`` (and the matching ``DP_NlistSetNprocs`` / ``deepmd::hpp::InputNlist::set_nprocs`` wrappers) so LAMMPS pair styles can forward ``comm->nprocs``. - ``pair_deepmd.cpp``, ``pair_deepspin.cpp``, and ``fix_dplr.cpp`` now call ``lmp_list.set_nprocs(comm->nprocs)``. - Refactor the ``serialization.py`` ``has_message_passing`` probe to walk ``model -> atomic_model -> descriptor`` (per njzjz-bot review), so composite atomic models (e.g. LinearAtomicModel) report the flag. - Rewrite the stale ``world == nullptr`` carve-out comment in the identity-mapping fallback blocks.
Replaces the ``set_nprocs(comm->nprocs)`` setter with a constructor parameter on both ``InputNlist`` overloads (lightweight and comm-aware) and the matching ``DP_NewNlist`` / ``DP_NewNlist_comm`` C entry points plus the hpp wrappers. ``nprocs`` is a conceptual sibling of ``world`` and ``nswap`` and lives more naturally in the initializer than as a post-construction setter. Default value 1 keeps direct C++ API consumers source-compatible — they need not pass anything. - ``InputNlist`` lightweight ctor: new trailing ``int nprocs_ = 1``. - ``InputNlist`` comm-aware ctor: new trailing ``int nprocs_ = 1``. - ``DP_NewNlist`` / ``DP_NewNlist_comm``: new trailing ``int nprocs`` arg (no default — C API). - ``deepmd::hpp::InputNlist`` ctors: new trailing ``int nprocs = 1``. - ``DP_NlistSetNprocs`` / ``set_nprocs`` removed (single-use in-tree). - ``pair_deepmd.cpp``, ``pair_deepspin.cpp``, ``fix_dplr.cpp`` now pass ``comm->nprocs`` directly in the constructor call. - Update stale comments in ``DeepPotPTExpt.cc`` and ``DeepSpinPTExpt.cc`` to reference the constructor parameter instead of the removed setter.
d88a46a to
4fcd312
Compare
Reviewer ask (njzjz-bot / coderabbitai on PR deepmodeling#5450): the ``test_pair_deepspin_mpi_no_atom_map`` parity check compared energy, forces, and force_mag against the with-atom-map multi-rank baseline but skipped virials. Add the matching ``virials`` assertion so a regression in the spin with-comm path that affects only the virial tensor cannot pass silently. Other spin MPI parity tests in this file already assert virials at the same tolerance.
…ment The lightweight 4-arg ``InputNlist`` (and matching ``DP_NewNlist`` / hpp wrapper) takes neither ``world`` nor any swap arrays, so it cannot drive the with-comm dispatch path even when constructed in a multi-rank LAMMPS context. The earlier addition of an ``int nprocs_ = 1`` parameter was therefore write-only — passing ``comm->nprocs`` to it (as ``fix_dplr.cpp`` did) had no observable effect on dispatch. Drop the parameter to restore the original ABI of the 4-arg form and make the contract honest: this overload is always single-rank-classified. Multi-rank GNN dispatch goes through the 12-arg comm-aware constructor. The ``nprocs`` data member stays (default 1), settable only via the comm-aware constructor's trailing ``nprocs_`` argument. Also mirror the decision-matrix comment from ``DeepPotPTExpt.cc`` into ``DeepSpinPTExpt.cc`` so the spin path documents the four-cell rationale in-line rather than redirecting the reader to the non-spin file.
njzjz-bot
left a comment
There was a problem hiding this comment.
Re-reviewed latest tip d4a9e5d. The previous concerns have been addressed, CI is green except the still-running CUDA aggregate check, and I did not find any new blocker.
— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)
4604131
…ers (deepmodeling#5455) ## Summary - Adds an explicit `del lammps` before `MPI.Finalize()` in all four mpirun-driven LAMMPS test runners (`run_mpi_pair_deepmd.py`, `run_mpi_pair_deepmd_spin.py`, `run_mpi_pair_deepmd_dpa3_pt2.py`, `run_mpi_pair_deepmd_spin_dpa3_pt2.py`). - Fixes a teardown-order race that intermittently manifests as **subprocess exit code 136 (SIGFPE)** for `test_pair_deepmd_mpi_dpa3_spin_empty_subdomain` on the GitHub Actions CUDA runner image. ## Background Recent CI runs on multiple unrelated PRs (deepmodeling#5446, deepmodeling#5450) hit the identical failure signature: ``` short test summary info ============================ FAILED source/lmp/tests/test_lammps_spin_dpa3_pt2.py::test_pair_deepmd_mpi_dpa3_spin_empty_subdomain - subprocess.CalledProcessError: ... returned non-zero exit status 136. ``` - Reproduces ~1 in 5 runs on the GitHub Actions CUDA image (`nvidia/cuda:12.9.1-cudnn-devel-ubuntu22.04`). - **Does not reproduce on a V100 Bohrium dev box** — 60/60 consecutive passes. So it's a pre-existing flake, not caused by either of the recent PRs. ## Root cause (empirically confirmed) The runner ends with: ```python forces_global = lammps.lmp.gather_atoms(...) ... MPI.Finalize() ``` `lammps` is still alive when `MPI.Finalize()` returns. Python then garbage-collects it during interpreter shutdown, which triggers `LAMMPS::~LAMMPS` → `Finish::end()` → **`MPI_Allreduce`** for timing aggregation. By that time, MPI has already been finalized, which is undefined behavior. I instrumented the runner with timestamped prints to verify the order directly. Without the fix: ``` t=3311.770 R1: BEFORE MPI.Finalize t=3311.778 R0/R1: AFTER MPI.Finalize ← MPI is finalized t=3311.778 R0/R1: PY ATEXIT … process exit, LAMMPS destructor runs HERE ``` With the fix: ``` t=3423.100 R1: AFTER del lammps (LAMMPS destructor done) ← MPI still up t=3423.108 R0/R1: BEFORE MPI.Finalize t=3423.108 R0/R1: AFTER MPI.Finalize ``` So the LAMMPS destructor now runs while MPI is still up, which is what its `MPI_Allreduce`/`MPI_Gather` calls require. The reason this manifests as SIGFPE only on the CUDA CI image (not on V100) is most likely that the CI image (or one of its preloaded libraries) enables FP-exception trapping; on V100 the same MPI-after-Finalize errors return silently. The flake is environment-specific, but the underlying antipattern is unconditional and worth fixing in any environment. ## Test plan - [x] Local CPU: 29/29 LAMMPS tests pass (`test_lammps_dpa3_pt2.py`, `test_lammps_spin_dpa3_pt2.py`) - [x] Remote V100: 50/50 stress runs of the previously-failing test - [x] Empirical confirmation that the fix flips the LAMMPS-destructor-vs-MPI.Finalize ordering (see Background) - [ ] CI: re-run the spin LAMMPS suite multiple times to confirm the SIGFPE no longer appears ## Known limitations - Cannot directly observe the SIGFPE on V100, so the fix has not been observed *preventing* the actual crash — only correcting the antipattern that we have strong reason to believe causes it. - If the failure persists after merge, the next candidate root cause is CUDA stream destruction order, and we should revisit. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved MPI cleanup sequence in multiple test runners to prevent finalization-related crashes when executing tests in distributed MPI environments. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5455?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
…ling#5467) ## Summary `TestChangeBias` is the dominant memory hog in `Test Python` shard `(10, 3.13)` of the CI matrix — by itself it peaks at **~5 GB RSS**, leaving so little headroom under the 7 GB GitHub-hosted runner that the shard intermittently loses communication with the GitHub Actions server. This causes the recurring `runner lost communication` failure that has affected many recent PRs (deepmodeling#5446, deepmodeling#5448, deepmodeling#5450, deepmodeling#5455, deepmodeling#5456, …). This PR shrinks the change-bias test dataset from 80 frames to 5 frames, dropping the class's peak RSS to **~1.7 GB** while keeping all 9 tests passing — including the strict `atol=1e-10` `pt2_pte_consistency` check. ## How I located it Local reproduction of shard `(10, 3.13)` (using the same `.test_durations` cache CI uses, so identical test partitioning): | Test profiled in isolation | Peak RSS | |---|---| | **`test_change_bias_frozen_pte`** | **5.04 GB** ← outlier | | `TestDeepEvalEnerPt2` class | 1.43 GB | | `TestDeepEvalEnerAparamPt2` class | 1.44 GB | | `TestSpinInference::test_get_use_spin` | 1.41 GB | | `test_finetune_from_pt2_use_pretrain_script` | 1.41 GB | | `test_training_loop_compiled` | 1.32 GB | | `test_export_pipeline` | 1.57 GB | | `test_descriptor_shape_dpa1` | 1.34 GB | Then phase-by-phase RSS profiling inside `dp change-bias` showed the 4.3 GB jump happens entirely inside `compute_output_stats` → `_compute_model_predict`. Scaling experiment confirms it: peak grows **linearly at ~50 MB per frame** of input data. | nbatches | Peak RSS | per-frame | |---|---|---| | 1 | 567 MB | — | | 5 | 781 MB | +43 MB | | 20 | 1583 MB | +53 MB | | 80 | 4797 MB | +53 MB | That's a leak in the `torch.no_grad()`-wrapped `forward_common_atomic` somewhere — separate from autograd. The water example has 80 frames at batch_size=1, so the CLI default `nbatches = min(data.get_nbatches()) = 80` triggers all 80 forwards in one go. ## Why I can't just pass `-n 5` `_load_batch_set` shuffles when it loads the set. If `nbatches < total_frames`, the loop samples a random subset — and the two calls in `test_change_bias_pt2_pte_consistency` (running in the **same Python process** via `main(cmds)`, with `dp_random`'s state advancing between calls) would see **different** subsets → different biases → the `atol=1e-10` assertion fails. `nbatches == total_frames` makes the forward enumerate **every** frame regardless of shuffle order, so the aggregate bias is invariant under shuffle. Determinism is preserved. ## The fix Build a 5-frame subset of `examples/water/data/data_0` in `TestChangeBias.setUpClass` and point both the trainer config and the change-bias `-s` argument at it. `nbatches` then resolves to 5 (= the new dataset size = full enumeration), and all 9 tests pass with peak RSS at ~1.7 GB. ## Test plan - [x] All 9 tests in `TestChangeBias` pass locally (CPU fp64): - `test_change_bias_with_data` - `test_change_bias_with_data_sys_file` - `test_change_bias_with_user_defined` - `test_change_bias_frozen_pte` - `test_change_bias_frozen_pt2` - `test_change_bias_frozen_pt2_user_defined` - `test_change_bias_pt2_pte_consistency` (atol=1e-10 — the determinism-sensitive one) - `test_change_bias_pte_preserves_model_def_script` - `test_change_bias_pt2_preserves_model_def_script` - [x] Peak RSS measurement (kernel `ru_maxrss`): - Before: 5.66 GB - After: **1.62 GB** (single test) / **1.75 GB** (whole class) - [ ] CI shard `(10, 3.13)` confirms no more `runner lost communication` on this branch (pending CI run) ## Known limitations - **The underlying ~50 MB/frame leak in `forward_common_atomic` remains as a production bug.** Users running `dp change-bias` on real datasets with thousands of frames will see multi-GB RSS growth. Worth a separate follow-up to find and patch the leak. - The 5-frame number is somewhat arbitrary. It's chosen as the smallest value that (a) keeps the assertion logic working, (b) leaves room for the least-squares regression (need ≥ ntypes = 2 frames). - `_make_subset_dataset` only handles `set.000`; multi-set datasets would need extension. Not needed for water/data_0. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Reduced the dataset used by a change-bias test to a small, fixed subset (5 frames) so test runs use far less disk space and complete faster. * Test setup now builds and points to the truncated dataset for all related invocations, lowering resource overhead during CI and local testing. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5467?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
Summary
.pt2inference for message-passing models (DPA2, DPA3, hybrids over those) when the LAMMPS atom-map is not enabled. Previously the C++ side fell into an identity-mapping fallback (DeepPotPTExpt.cc:374-384) whose values are wrong for ghost slots; the model's_exchange_ghosts(deepmd/dpmodel/descriptor/repformers.py) then performedtake_along_axis(g1[1, nloc, dim], mapping_tiled)with out-of-bounds gather indices for ghosts — CUDA index assert in the user's DPA4 report, undefined CPU output otherwise.has_message_passingfield to .pt2 metadata (mirrors the descriptor'shas_message_passing()API: true for DPA2/DPA3/hybrids over those; false for se_e2_a/DPA1/etc.). Gate the fail-fast inDeepPotPTExpt::compute_innerandDeepSpinPTExpt::compute_inneron it. Non-GNN models retain their previous behaviour.atom_modify map yes…"use_loc_mapping=False…"has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0. Thenghost > 0guard skips NoPbc and isolated-cluster cases where identity over[0, nloc)is trivially correct.Four-cell coverage matrix in
test_lammps_dpa3_pt2.pyuse_loc_mappingtest_pair_deepmd(existing)test_pair_deepmd_no_atom_map_fails_fast(new)test_pair_deepmd_mpi_no_with_comm_fails_fast(new, subprocess)test_pair_deepmd_with_comm(new)border_op)test_pair_deepmd_mpi_dpa3(existing)test_pair_deepmd_with_comm_no_atom_map_fails_fast(new)test_pair_deepmd_mpi_no_atom_map(new, subprocess)Investigation note (resolves an earlier mystery)
test_deeppot_dpa_ptexpt.ccis misleadingly named — despite theDpaprefix it loadsdeeppot_dpa1.pt2(DPA1, non-message-passing). Its regular.pt2graph never consumesmappingfor ghost gather, so the identity fallback was trivially safe and the test passed without explicitinlist.mapping. The genuinely-DPA2 ctest istest_deeppot_dpa2_ptexpt.cc(different file), which already explicitly setsinlist.mapping = mapping.data();on allcpu_lmp_nlist*paths. No C++ ctest fixtures need editing in this PR — the metadata-gated fail-fast correctly skips DPA1.Backward compatibility
has_message_passing_defaults to false in C++ when the metadata field is missing — so pre-PR .pt2 archives retain their previous behaviour. Non-GNN pre-PR archives continue to work; GNN pre-PR archives must be regenerated to opt into the fail-fast guard. In-tree fixtures are generated bygen_*.pyat CI time, which always writes the new field.Test plan
*PtExpt*filter: 160 / 160 PASSED (270 s) against freshly-regenerated.pt2fixtures.pytest.raises(Exception, match=r\"atom_modify map yes\")and stdout/stderr substringuse_loc_mapping=False; if LAMMPS wraps the exception with a prefix/suffix differently than expected, the match may need adjustment.test_pair_deepmd_mpi_no_atom_map) verifies the with-comm artifact handles ghosts viaborder_opwithout consuming the mapping tensor.Known limitations
use_loc_mapping=Trueis permanently unsupported by this fix — the fail-fast surfaces it clearly, no path forward without re-export.comm_dict; deferred to a follow-up.MPI_Comm_sizeis not used as the multi-rank predicate becauseapi_ccdoes not link MPI directly;lmp_list.nswap > 0serves as the proxy (equivalent for all current LAMMPS configurations).use_loc_mapping=Truearchives lacking the new metadata field continue to exhibit the silent-corruption bug — users must regenerate.Summary by CodeRabbit
New Features
Bug Fixes
Tests